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Abstract 

Background: PubChem is a free and publicly available resource containing substance descriptions and their 
associated biological activity information. PubChem3D is an extension to PubChem containing 
computationally-derived three-dimensional (3-D) structures of small molecules. All the tools and services that are a 
part of PubChem3D rely upon the quality of the 3-D conformer models. Construction of the conformer models 
currently available in PubChem3D involves a clustering stage to sample the conformational space spanned by the 
molecule. While this stage allows one to downsize the conformer models to more manageable size, it may result in 
a loss of the ability to reproduce experimentally determined "bioactive" conformations, for example, found for PDB 
ligands. This study examines the extent of this accuracy loss and considers its effect on the 3-D similarity analysis of 
molecules. 

Results: The conformer models consisting of up to 100,000 conformers per compound were generated for 47,123 
small molecules whose structures were experimentally determined, and the conformers in each conformer model 
were clustered to reduce the size of the conformer model to a maximum of 500 conformers per molecule. The 
accuracy of the conformer models before and after clustering was evaluated using five different measures: 
root-mean-square distance (RMSD), shape-optimized shape-Tanimoto {ST 5T ~ opt ) and combo-Tanimoto {ComboT 57 ' 0 ^), 
and color-optimized color-Tanimoto {Cf T ~ opt ) and combo-Tanimoto {Combol CT ~ opt ). On average, the effect of 
clustering decreased the conformer model accuracy, increasing the conformer ensemble's RMSD to the bioactive 
conformer (by 0.18 ± 0.12 A), and decreasing the ST ST -° pt , ComboT 5T -° pt , Cf 1 '^ 1 , and Combof 1 ^ 1 scores 
(by 0.04 ± 0.03, 0.16 ± 0.09, 0.09 ± 0.05, and 0.15 ± 0.09, respectively). 

Conclusion: This study shows the RMSD accuracy performance of the PubChem3D conformer models is operating 
as designed. In addition, the effect of PubChem3D sampling on 3-D similarity measures shows that there is a linear 
degradation of average accuracy with respect to molecular size and flexibility. Generally speaking, one can likely 
expect the worst-case minimum accuracy of 90% or more of the PubChem3D ensembles to be 0.75, 1.09, 0.43, and 
1.13, in terms of SJ SJ ~ op \ ComboT ST ~ opt \ Cf T ~ opt , and Combol CT ~ opl \ respectively. This expected accuracy improves 
linearly as the molecule becomes smaller or less flexible. 



Background 

The advent of combinatorial chemistry and high- 
throughput screening technology has made it possible 
to perform a rapid test of biological activity on a vast 
number of small molecules, generating a massive 
amount of biological activity data. While this explosion 
of information presents scientists with great opportun- 
ities to facilitate the identification of potential drug 
candidates and chemical probes, its benefit is enhanced 
when this data is combined with that of the others and 
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made available to all Dissemination of such information 
requires a public repository that collects and stores the 
heterogeneous data from various contributors. An 
example of such a repository is PubChem [1-4] (http:// 
pubchem.ncbi.nlm.nih.gov), launched in 2004 as a com- 
ponent of the Molecular Libraries Roadmap Initiatives 
of the U.S. National Institutes of Health. PubChem 
archives biological activity screening data and other in- 
formation from diverse data sources and offers its 
contents free of charge to the biomedical research com- 
munity, facilitating research that benefits human health. 

PubChem consists of three primary databases: 
Substance, Compound, and BioAssay. The PubChem 
Substance database contains sample descriptions provided 



\Cj \ ChemistryCentral 



© 2013 Kim et al.; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative 
Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Kim et al. Journal of Cheminformatics 2013, 5:1 
http://www.jcheminf.eom/content/5/1/1 



Page 2 of 1 7 



by individual depositors and the PubChem Compound 
database contains the unique standardized chemical struc- 
ture contents extracted from the PubChem Substance 
database. The PubChem BioAssay database contains 
biological assay descriptions and results. As of June 2012, 
PubChem contains more than 92 million substance 
descriptions, 32 million unique small molecules, 600 thou- 
sand biological assays, and 170 million biological assay 
outcomes (each outcome is a set of results from a sub- 
stance being tested in an assay). PubChem provides 
search, analysis, and download tools for the efficient use of 
this vast amount of chemical information. Many of these 
tools exploit the concept of molecular similarity at some 
level. One method in which PubChem evaluates chemical 
similarity between two molecules is to use a two- 
dimensional (2-D) dictionary-based fingerprint [5] and the 
Tanimoto equation [6,7]: 



Tanimoto 



AB 



A+B-AB 



(1) 



where A and B are the respective counts of the set binary 
fingerprint bits for the two molecules and AB is the count 
of set bits in common to both molecules. Because the 2-D 
molecular similarity computation is very fast (typically at a 
rate of one million pair-wise comparisons per second per 
CPU core), it is appropriate for searching a large database 
like PubChem. However, there are many diverse chemical 
structures with similar biological efficacies against targets 
available in PubChem that can be difficult to interrelate 
using traditional 2-D similarity methods [8-11]. To assist 
in biological activity analysis of these molecules, a new 
layer called PubChem3D [8-15] was added to PubChem. 

PubChem3D generates a 3-D conformer model de- 
scription for each record in the PubChem Compound 
database, when it satisfies the following conditions [13]: 

(1) not too large (with 50 or fewer non-hydrogen atoms); 

(2) not too flexible (with no more than 15 rotatable 
bonds); (3) has only a single covalent unit (i.e., not a salt 
or mixture); (4) consists of only supported elements 
(H, C, N, O, F, Si, P, S, CI, Br, and I); (5) contains only 
atom-types recognized by the Merck Molecular Force 
Field (MMFF94s) [16,17]; and (6) five or fewer undefined 
atom (R,S) and bond (E,Z) stereo centers. This 3-D de- 
scription can be employed to enhance existing PubChem 
search and analysis methodologies by means of 3-D 
similarity [10], helping the user identify useful structure- 
activity relationships that might go unrecognized by the 
PubChem 2-D similarity method. A diverse conformer 
ordering [10] gives a maximal description of the con- 
formational space of a molecule when only a subset of 
available conformers is used. A pre-computed search per 
compound record gives immediate access to a set of 3-D 
similar compounds (called "Similar Conformers" [8]) in 



PubChem and their respective superpositions, augmen- 
ting the complementary "Similar Compounds" relation- 
ship, computed using the PubChem 2-D similarity method. 
Systematic augmentation of PubChem resources to include 
a 3-D layer provides users with new capabilities to search, 
subset, visualize, analyze, and download data [10]. 

All the tools and services in PubChem3D rely upon 
the quality and applicability of the computationally- 
derived 3-D conformer models of small molecules. Con- 
sidering the size of PubChem, all these conformer 
models by necessity must be pre-computed and stored 
to allow the user "real-time" access to identify structur- 
ally similar conformers and to analyze biological activity 
patterns. Among many different conformer generation 
programs that exist [18-24], PubChem3D uses the 
OMEGA C++ toolkit [25-28] to generate conformer 
ensembles. In our previous study [13], an optimal set of 
adjustable parameters were determined to maximize the 
"accuracy" of OMEGA (i.e., the ability to reproduce 
experimentally-determined "bioactive" conformations, 
for example, found in protein-ligand complexes). Using 
experimentally determined structures of 25,972 small- 
molecule ligands found in the Protein Databank (PDB) 
[29], the effects of parameter values used in OMEGA 
upon the root-mean-square distance (RMSD) between 
the computationally-derived conformer models and their 
experimentally-determined bioactive conformations were 
analyzed in terms of the non-hydrogen (heavy) atom 
count (N NHA ) and effective rotor count (Af £ #), as mea- 
sures of molecular size and flexibility, respectively [30]. 
Note that N ER is given by the following equation and 
takes into account molecular flexibility due to rotatable 
bonds and ring flexibility: 



N 



ER 



■N R + - 



N 



NARA 



(2) 



where N R is the number of rotatable bonds, and N NARA 
is the number of non-aromatic s/? 3 -hybridized ring 
atoms. The root-mean-square distance (RMSD) accuracy 
of the computationally-derived conformer models was 
found to strongly depend on molecular size and flexibil- 
ity, leading to the following formula [13] that estimates 
the worst-case RMSD accuracy of nearly all the con- 
former models using only the values of N NHA and N ER - 



RMSLf 1 



ired 



-- 0.219 + 0.0099 x N NHA + 0.040 
x N ER (3) 



where RMSLf red is the predicted upper limit of the 
RMSD accuracy to ensure at least 90% of conformer 
models generated by OMEGA using the selected 
PubChem parameter set for the 25,972 PDB ligands had 
at least one "bioactive" conformer whose RMSD distance 



Kim et al. Journal of Cheminformatics 2013, 5:1 
http://www.jcheminf.eom/content/5/1/1 



Page 3 of 1 7 



from the experimentally determined conformation was 
closer than the value predicted using Equation (3). 

Using this accuracy-calibrated OMEGA parameter set, 
PubChem3D generates up to 100,000 conformers for 
each chemical structure stereo configuration. However, 
it is still not feasible to store all the conformers in a 
database and use them in a very efficient way. Therefore, 
the conformers in each ensemble are sampled through 
clustering with the RMSLf red value, after rounding to 
the nearest 0.2 increment as defined in the following 
equation, 



RMSD 1 



fluster 



int(0.5+£MW red *5) 



(4) 



where "int( )" gives the whole number, irrespective of 
any remaining fraction and where this RMSD threshold 
is referred to as RMSD cluster to emphasize its usage for 
clustering purposes (rather than accuracy prediction). 
Each sampled conformer represents a cluster containing 
all conformers within the designated RMSD threshold, 
thus reducing the count of conformers per conformer 
model. If the conformer model after cluster sampling 
has more than a maximum of 500 conformers, it is re- 
clustered using an 

RMSD cluster value incremented by a 

further 0.2. This process is repeated as many times as 
necessary to reduce the overall conformer count to 500 
or less. Although this clustering process makes the con- 
former models more manageable in size and better sui- 
ted for a large database such as PubChem, it may be 
accompanied with an undesirable loss of overall accur- 
acy of the conformer model. Therefore, in the present 
study, we investigated the effect of the conformer model 
clustering upon the accuracy of the conformer models 
as a follow-up to our previous study [13] in order to ad- 
dress key questions as to the performance of PubChem 
3D sampled conformers to reproduce "bioactive" ligand 
geometries: as a function of molecular size and flexibil- 
ity, with respect to the established PubChem3D similar- 
ity measures, and with an eye towards their expected 
performance relative to biological activity data analysis. 

Results and discussion 

Molecular size and flexibility of the MMDB ligands 

This study considers 47,123 small molecules with experi- 
mental 3-D coordinates available from the Molecular 
Modeling Database (MMDB) [31] deposition in PubChem 
(Additional file 1). The molecular connectivity of these 
MMDB ligands is derived from the 3-D coordinates of the 
protein-bound small molecules taken from PDB [29] 
records. Note that the "experimental" structures of small 
molecules in PDB are known to, at times, have non-trivial 
issues or uncertainty concerning their precise chemical 
identity, protein binding geometry, or crystal structure 



location [13,32-36]. The present study largely ignores such 
potential issues and considers all the 3-D ligand structures 
as experimental facts. 

The effects of conformer ensemble clustering upon the 
accuracy of the conformer ensemble were analyzed as a 
function of N NHA and N R (as measures of molecular size 
and flexibility, respectively). Additionally, N ER [Equation 
(2)] was also employed to represent molecular flexibility. 
Although the value of N ER is not necessarily an integer, it 
was rounded to the nearest integer in the present study. 
Figure 1 shows the distributions of the values of N NHA , 
N R , and N ER for the experimental structures for the 47,123 
MMDB ligands considered. On average, the structures 
had 18.3 ± 10.5 non-hydrogen (heavy) atoms, 4.7 ± 3.5 
rotatable bonds, and 5.4 ± 3.7 effective rotors. Approxi- 
mately 90% of the ligands had less than 31 non-hydrogen 
atoms, 9 rotatable bonds, and 10 effective rotors. 

PubChem3D generates a maximum of 100,000 confor- 
mers per compound stereo configuration for efficiency 
considerations. Reaching this "100-K" limit suggests a 
loss in conformational space considered, possibly result- 
ing in less accurate conformer models [13]. As shown in 
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Figure 1 Molecular size and flexibility of bioactive ligand data 
set. Frequency of (a) the non-hydrogen (heavy) atom count and (b) 
the rotatable bond count and the effective rotor count for the 
47,123 experimentally determined "bioactive" ligand structures in the 
MMDB data set. The effective rotor counts were binned to the 
nearest whole numbers. 
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Figure 2, the fraction of the molecules hitting the 100-K 
limit increases rapidly as the molecule becomes larger 
and more flexible. This suggests that, beyond 25 heavy 
atoms and six rotatable bonds, exploration of conform- 
ational space in some PubChem3D conformer models 
may be truncated due to this limitation. 

RMSD clustering threshold 

After generating conformers for each molecule, Pub- 
Chem3D samples the conformers in each ensemble, 
using an RMSD threshold determined according to 
Equation (3) and Equation (4). Figure 3 shows the distri- 
bution of the RMSD cluster values used to cluster the con- 
formers in the conformer ensemble for the 47,123 
MMDB ligands considered in the present study. The 
RMSD cluster values range from 0.4 A to 2.2 A (in discrete 
0.2 A increments), with an average and standard deviation 
of 0.75 A ± 0.33 A. Approximately 85% of the 47,123 PDB 
ligands have an RMSD cluster value of < 1.0 A. 

The distributions of the RMSD "accuracy" of the resul- 
ting conformer ensembles to the experimental PDB ligand 
geometries are shown in Figure 4 and their average and 
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Figure 2 The 100-K limit cases vs. molecular size and flexibility. 

The fraction of the 47,123 MMDB ligands reaching the limit of 
100,000 conformers per compound during the conformer 
generation step as a function of: (a) the non-hydrogen atom count 
and (b) the rotatable bond count and the effective rotor count. 



standard deviations are summarized in Table 1. It is im- 
portant to note that, for the purpose of this study, the 
RMSD "accuracy" value of a computationally-derived con- 
former ensemble to the experimental "bioactive" conform- 
ation is defined as the single best (i.e., least) non-hydrogen 
atom-pairwise RMSD between the experimentally deter- 
mined 3-D conformation PDB ligand and a 3-D con- 
former in the ensemble. This RMSD "accuracy" should 
not be confused with the clustering RMSD (RMSD cluster ). 

As expected, the PubChem3D conformer sampling 
procedure results in a loss of the conformer ensemble 
RMSD accuracy relative to experiment. On average, this 
overall loss is 0.18 A (from 0.39 A to 0.57 A). The stand- 
ard deviation of this average also increases by 0.12 A 
(from 0.24 A to 0.36 A) and may reflect the rounding of 
RMSD cluster to the nearest 0.2 increment, potentially sug- 
gesting the ±0.1 nature of such a change. In the aggre- 
gate, 90% of all conformer models in this study reflect 
RMSD accuracies better than 1.1 A after clustering. 

In the study by Hawkins et al. [18], an RMSD value of 
1.25 A or less was employed as the definition of a "close" 
reproduction of the experimental conformation. They 
also pointed out that an RMSD of 2.0 A could have been 
used as a cut-off because it is a common upper bound 
for successful reproduction of an experimental structure 
in molecular docking. With these criteria in mind, the 
after-clustering conformer models in the present study 
may be considered to be of high quality, although the 
choice of the cut-off for "close" reproduction of the ex- 
perimental structure is still arbitrary and subjective. 

Figure 5 illustrates the percentage of the conformer 
models with accuracy better than the RMSD cluster value 
(/.a, with an RMSD accuracy value less than RMSD cluster ) 
before and after the conformer sampling procedure. For 
comparison purposes, those conformer models with the 
accuracy better than RMSD cluster + 0.1 A are also included 
to show the effects of the rounding of RMSLf red to the 
nearest 0.2 (i.e., RMSD cluster ). It is important for two rea- 
sons to note that more than 90% of the conformer models 
before clustering have an RMSD accuracy better than 
RMSD cluster . Firstly, this shows that, although the majority 
of conformer models with more than 25 heavy atoms and 
six rotatable bonds hit the 100-K limit in conformer 
generation as shown in Figure 2, there is no significant 
adverse effect on the conformer model accuracy beyond 
that already reflected by Equation (4). Secondly, this indi- 
cates the resilience of OMEGA to generate biologically 
relevant conformers, favoring a breadth-first exploration 
of conformer space as a function of energy threshold 
{i.e., considering low-energy conformational spaces first), 
thus ensuring an even coverage of conformer space up to 
the PubChem3D 100-K conformer limit in the conformer 
generation phase. For the remaining cases where the con- 
former models before clustering are not more accurate 
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Figure 3 RMSD cluster conformer model sampling threshold. The frequency of the RMSD clustering threshold (RMSD cluster ) values used during 
the conformer clustering procedure for the 47,123 MMDB ligand conformer models. 



than the RMSD cLuster threshold, OMEGA does not even 
come close to reproducing the experimental geometry 
using the PubChem3D choice of parameters, considering 
that an increase of 0.1 A from RMSD cluster does not find 
many additional "missed" pre-clustered conformer mo- 
dels, as Figure 5 shows. Whereas potential reasons for this 
are numerous, one can likely attribute it to: improper per- 
ception of atom hybridization or charge state from PDB 
atom coordinates in the MMDB deposition (leading to an 
inaccurate chemical structure, as PDB records tend not to 
include hydrogen atom or bond order information); some 
combination of errors, uncertainty, or omissions in the 
PDB ligand information (as mentioned previously); or a 
general inability of OMEGA to reproduce some "bio- 
active" 3-D chemical structure configurations. 
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Figure 4 Overall RMSD accuracy of the conformer models. The 

RMSD accuracy (binned in 0.1 A increments) of the 47,123 MMDB 
ligand conformer models to the corresponding experimental 3-D 
structure, before and after the conformer model clustering 
procedure, by frequency and cumulative % frequency. 

V / 



As shown in Figure 5, after clustering, the fraction of the 
conformer models with accuracy better than RMSD cluster 
decreased by no more than 12% in general, except for 
RMSD cluster = 1.6 A (34%). When the conformer model 
accuracy was predicted in a more conservative way using 
the limit RMSD cluster + 0.1 A (rather than RMSD cluster ), the 
difference between the conformer models with accuracy 
better than this limit before and after clustering was no 
more than 6%, except for RMSD cluster = 1.6 A (23%), sho- 
wing that the realized sampling effects are local in nature. 
It appears that most of the structures with decreased con- 
former model accuracy at RMSD cluster = 1.6 A are simply 
due to an unfortunate culmination of pronounced 
partition-based clustering edge-effects for a set of flexible 
di- and tri-phosphate containing structures. In other 
words, for these particular computationally-derived con- 
former models, a conformation most similar to the experi- 
mental structures happened to be near the boundaries of 
the clusters generated with the RMSD cluster = 1.6 A, and 
therefore, they were not included in the conformer models 
after the clustering procedure. As a result, clustering 
a given conformer model using a different conformer 
ordering with the same clustering procedure could have 
yielded results closer to the pre-clustering result. 



Table 1 Summary statistics of overall conformer model 
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MMDB ligand conformer models to the corresponding experimental 3-D 
structure before and after conformer sampling clustering. 
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Figure 5 Conformer models with accuracy better than RMSD clu5ter . The fraction of the conformer models of the 47,123 MMDB ligands with 
an RMSD to the corresponding experimental 3-D structure less than the RMSD clustering threshold {RMSD cluster ) (solid line) and RMSD cluster + 0.1 A 
(dashed line). 



Figure 6 illustrates the cumulative % distribution of 
the RMSD accuracy of the conformer models for each 
discrete value of RMSD cluster . As mentioned above, the 
RMSD cluster value determination using Equation (3) was 
intended to ensure that 90% of conformer models have 
an RMSD accuracy below RMSD cluster before sampling; 
however, the RMSD accuracy after sampling to ensure 



90% of conformers are found is expected to be within 
the range of RMSD cluster ± 0.1, when considering the 
effects of the rounding of RMSD cluster to the nearest in- 
crement of 0.2 [as performed in Equation (4)]. If one 
looks across the 90% line in panel (a) of Figure 6, the 
RMSD accuracies of 90% of the conformer models before 
clustering are smaller than RMSD cluster for the entire range 
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Figure 6 Accuracy of conformer models as a function of RMSD cluster . The cumulative % distribution of the RMSD accuracy 
(binned in 0.1 A increments) of the 47,123 MMDB ligand conformer models to the corresponding experimental 3-D structure as a function of 
RMSD clustering threshold {RMSD cluster )\ (a) before clustering and (b) after clustering. [Note the three conformer models at 2.2 A were removed 
from this and some other figures for clarity]. 
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in general, with almost no difference at RMSD cluster = 
1.4 A. The 90% levels of the after-clustering RMSD accur- 
acies [in panel (b) of Figure 6], are within the expected 
range in general, except for RMSD cluster = 1.4 A and 1.6 A, 
where the RMSD accuracy for 90% of the conformer mod- 
els is not reached until 1.6 A and 1.8 A, respectively. 

One readily notices for each RMSD cluster value in 
Figure 6 that conformer model clustering shifts the cu- 
mulative % distribution curves toward the right-hand 
side, indicating a decrease in the conformer model ac- 
curacy as a result of the PubChem sampling procedure. 
Looking at the 90% level of conformer models before 
and after clustering, there are some variances in the 
change of the conformer model accuracy, depending on 
the RMSD cluster value. For example, the difference be- 
tween the RMSD accuracies at the 90% level before and 
after clustering with RMSD cluster of 0.4 A and 0.6 A is 
0.1 A (0.25 A vs. 0.35 A for RMSD duster = OA A, and 
0.5 A vs. 0.6 A for RMSD cluster = 0.6 A). However, for the 
RMSD cluster values between 0.6 A and 1.6 A, the corre- 
sponding differences range between 0.2 A and 0.3 A. In 
general, it is very reassuring to see that most of these 
conformer models at the 90% level are within the 
expected range for most RMSD cluster values. Although 
sampling by its very nature will increase the distance be- 
tween conformers, this increase does not appear to 
severely impact the accuracy of the conformer models in 
PubChem. 

Comparison of ensemble accuracy measures 

Evaluation of the conformer model accuracy using 
RMSD is an intuitive and convenient choice, as the 
conformer model clustering in PubChem3D uses an 
RMSD value as a clustering threshold; however, in 
practice, PubChem3D primarily uses three measures 
in 3-D similarity comparison between molecules: 
shape-Tanimoto (ST), color-Tanimoto (CT), and combo- 
Tanimoto (ComboT) [37-40]. Therefore, the present study 
also employed PubChem3D similarity measures as add- 
itional conformer model accuracy measures. The ST 
[37-40] similarity measure, which quantifies the shape 
similarity between molecules, is defined as the following 
equation: 



donors, hydrogen-bond acceptors, cations, anions, hy- 
drophobes, and rings) by means of the equation: 



ST 



Va 



Vaa + Vbb — Vab 



(5) 



where Vaa and V BB are respective self-overlap volume of 
the two molecules, and V AB is the overlap volume between 
the two molecules. The CT [37,38] similarity measure, on 
the other hand, evaluates the pharmacophore feature 
similarity between molecules, by comparing the 3-D 
orientation of fictitious atoms (also called feature atoms) 
representing six functional group types (hydrogen-bond 



CT 



E^ 

/ 



/ / / 



(6) 



where the index "f is one of the six functional-group 
types, Vaa and V BB are the self-overlap volume for the 
functional group type "f of the two molecules, respect- 
ively, and V AB is the overlap volume for the functional 
group type "f between the molecules. The ComboT 
[37,38] similarity measure, which is defined as the arith- 
metic sum of the ST and CT scores, allows one to consider 
the two different similarities simultaneously. Because both 
the ST and CT scores range from 0 (for no similarity) to 
1 (for identical molecules), the ComboT score ranges 
from 0 to 2 (without normalization, due to pre-existing 
convention). 

The present study used two different approaches 
to compute these three 3-D similarity scores: the 
shape-optimized (or ST-optimized) approach and feature- 
optimized (or CT-optimized) approach. In the shape- 
optimized approach, the superposition of two molecules is 
optimized to have a maximum ST score and then the CT 
score is computed in that shape-optimized alignment. In 
the feature-optimized approach, the color and shape of 
the two conformers will be considered simultaneously to 
find the best superposition between them, as in the 
current version of ROCS [37]. In the present paper, the 
shape-optimized and feature-optimized methods are 
denoted using the superscripts "ST-opt" and "CT-opt", 
respectively. As a result, there are six different 3-D 
similarity scores {i.e., ST 6 ^, CT* T -° pt , ComboT ST ' opt , 

SJ CT-opt 9 CJ CT-opt^ Combo jCT-op^ A1()ng with 

RMSD, four of these six scores are used to analyze 
the accuracy of the clustered conformer models rela- 
tive to the experimentally determined 3-D geometries: 
SJ 6T-opt 9 CornboT ST - opt , CT 07 ^, and Combo^' 0 ^. 

As shown so far in this study, the conformer sampling 
procedure decreases ensemble accuracy to reproduce ex- 
perimentally determined ligand geometries, resulting in 
an increase in the RMSD values. This loss in accuracy is 
also seen for the four 3-D similarity values as shown in 
Table 1. On average, whereas the clustering increases 
the RMSD value of the conformer ensemble by 0.18 ± 
0.12 A, it decreases the ST* T ' op \ ComboT 6 ^, CT 07 ^, 
and Combo^' 00 scores by 0.04 ± 0.03, 0.16 ± 0.09, 
0.09 ± 0.05, and 0.15 ± 0.09, respectively. Although the 
Combo^' 00 values are (in the aggregate) slightly 
greater than the ComboT 61 - 0 ^ values, the decreases of 
the two similarity measures upon clustering are nearly 
identical in magnitude to each other, suggesting a 
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general insensitiveness of the ComboT scores to the 
optimization type. Perhaps more interesting is the rela- 
tively small change in the SJ^ T ' opt average of 0.04, 
whereas the ci CT ' opt average difference is more than 
twice as large, indicating a much greater sensitivity of 
QjCT-opt tQ c i us tering. This is not surprising, as shape is 
less discriminating than features (e.g., with a nitrogen 
atom and carbon atom being nearly identical from a 
shape perspective but completely different in their cap- 
ability to make intermolecular interactions). As shown in 
Figure 7, after the conformer sampling procedure, 90% of 
all the conformer models had accuracies better than 0.75, 
1.09, 0.43, and 1.13, in terms of SJ 6T -° pt t ComboT ST -° p \ 

CJ CT-opt^ ComboJ CT-opt^ respective fy 

Figures 8, 9, 10, 11, and 12 show the five different mea- 
sures of the conformer ensemble accuracy used (Le., RMSD, 

SJ ST-o ptj ComboJ ZT-o ptj CJ CT-opt f ^ ComboJ CT-op t) ^ ft 

function of molecular size and flexibility. [A further break- 
down of RMSD and ST €T ~ opt values as a function of 
molecular size and flexibility and the correlation between 
RMSD and ST ST ' opt can be found in Additional file 2: 
Figures S1-S7]. The linear nature of these curves demon- 
strates a clear association of the average PubChem3D 
conformer model accuracy with molecular size and flexibil- 
ity. Least-squares fitting to the form of "y = a + bx" for each 



data series in the plots from panel (d) in Figures 8, 9, 10, 
11, and 12 is summarized in Table 2. The least-squares fit- 
ting was also performed for the other data series in panels 
(a-c) of Figures 8, 9, 10, 11, and 12, but reported in 
Additional file 3 for brevity. As the RMSD cluster value (as 
well as N NHA , N R and N ER ) increases, all five conformer 
model accuracies before and after clustering linearly chan- 
ged. With the notable exception of N R , all R 2 values for 
these fits were greater than 0.90. In the case of N R , not 
taking into account the flexibility of rings reduces the R 2 
values to as low as 0.78. (In fact, the primary motivation of 
the development of N ER [30] was to account for "noise" in 
linear fits of N R such as these, to properly account for 
molecules that are effectively more flexible than their rotat- 
able bond count would suggest.) While all the RMSD and 
average accuracy measures did linearly correlate 
with the RMSD cluster value (Figures 8 and 9). However, for 
the Cornbol 6 ^, CI^ 1 **, and ComboT CT -° pt \ the diffe- 
rence between before and after clustering accuracy values 
did not always linearly correlate with the RMSD cluster value 
[namely: Figure 10, panels (a,d); Figure 11 panels (a-d); and 
Figure 12 panels (a,d)]. In the case of ci^ 7 ^, the average 
differences appear to plateau just below 0.2, suggesting that 
there may be some maximum error as a result of the Pub- 
Chem conformer clustering procedure. Echoes of this 
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Figure 7 Overall 3-D similarity accuracy of the conformer models. The accuracy (binned in 0.05 increments) of the 47,123 MMDB ligand 
conformer models to the corresponding experimental 3-D structure, before and after the conformer model clustering procedure, by frequency 
and cumulative % frequency for the 3-D similarity metrics: (a) ST ST ' opt , (b) ComboT ST ' opt , (c) Cf 1 ' 091 , and (d) Combof 1 ' 0 ^. 
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Figure 8 Average RMSD accuracy as a function of the molecular size, flexibility, and RMSD cluster . The average conformer model RMSD 
accuracy of the 47,123 MMDB ligand conformer models to the corresponding experimental 3-D structures, before and after the conformer model 
clustering procedure, as a function of: (a) the non-hydrogen atom count, (b) the rotatable bond count, (c) the effective rotor count, and (d) the 
RMSD clustering threshold (RMSD cluster ). 
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Figure 9 Average sr sr_opf accuracy as a function of the molecular size, flexibility, and RMSD cluster . The average conformer model 
shape-optimized shape-Tanimoto {ST ST ~ opt ) accuracy of the 47,123 MMDB ligand conformer models to the corresponding experimental 3-D 
structure, before and after the conformer model clustering procedure, as a function of: (a) the non-hydrogen atom count, (b) the rotatable bond 
count, (c) the effective rotor count, and (d) the RMSD clustering threshold {RMSD duster ). 
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Figure 10 Average ComboT STopt accuracy as a function of the molecular size, flexibility, and RMSD cluster . The average conformer model 
shape-optimized combo-Tanimoto (ComboT ST ~ opt ) accuracy of the 47,123 MMDB ligand conformer models to the corresponding experimental 3-D 
structure, before and after the conformer model clustering procedure, as a function of: (a) the non-hydrogen atom count, (b) the rotatable bond 
count, (c) the effective rotor count, and (d) the RMSD clustering threshold {RMSD duster ). 
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Figure 1 1 Average cf 17 ^ accuracy as a function of the molecular size, flexibility, and RMSD cluster . The average conformer model color- 
optimized color-Tanimoto (CT cr ~ opt ) accuracy of the 47,123 MMDB ligand conformer models to the corresponding experimental 3-D structure, 
before and after the conformer model clustering procedure, as a function of: (a) the non-hydrogen atom count, (b) the rotatable bond count, (c) 
the effective rotor count, and (d) the RMSD clustering threshold {RMSD cluster ). 
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Figure 12 Average ComboT CT opt accuracy as a function of the molecular size, flexibility, and RMSD c,uster . The average conformer model 
color-optimized combo-Tanimoto {Combol CT ~ opt ) accuracy of the 47,123 MMDB ligand conformer models to the corresponding experimental 3-D 
structure, before and after the conformer model clustering procedure, as a function of: (a) the non-hydrogen atom count, (b) the rotatable bond 
count, (c) the effective rotor count, and (d) the RMSD clustering threshold {RMSD cluster ). 



Table 2 Linear behavior of average conformer model accuracy as a function of RMSD ter value 



Accuracy measure 


Data series 


a 


b 




o b 


Oy 


R 2 


RMSD 


Before 


-0.06 


0.63 


0.087 


0.061 


0.11 


0.93 




After 


-0.07 


0.90 


0.083 


0.059 


0.11 


0.97 




Difference 


-0.01 


0.27 


0.045 


0.032 


0.06 


0.90 


^jST-opt 


Before 


1.06 


-0.15 


0.020 


0.014 


0.03 


0.94 




After 


1.06 


-0.21 


0.025 


0.018 


0.03 


0.95 




Difference 


-0.00 


0.06 


0.012 


0.009 


0.02 


0.85 


ComboT SJ -° pt 


Before 


2.13 


-0.53 


0.057 


0.040 


0.07 


0.96 




After 


1.99 


-0.59 


0.088 


0.062 


0.11 


0.92 




Difference 


0.14 


0.06 


0.060 


0.043 


0.08 


0.19 


Q-jcr-opt 


Before 


1.10 


-0.34 


0.038 


0.027 


0.05 


0.95 




After 


1.01 


-0.36 


0.050 


0.035 


0.06 


0.93 




Difference 


0.09 


0.02 


0.036 


0.025 


0.05 


0.07 


Combo-f J -° pt 


Before 


2.16 


-0.54 


0.058 


0.041 


0.07 


0.96 




After 


2.03 


-0.60 


0.091 


0.064 


0.12 


0.92 




Difference 


0.13 


0.06 


0.065 


0.046 


0.08 


0.17 



Results of linear least-squares fitting of the average conformer model accuracies vs. RMSD clustering threshold {RMSD er ) to the form of "y-o + bx". The sigma 
values {o a , o bl and o y ) correspond to the standard deviation of the fit to the predicted "a", "b", and "/' values. The "x" values are the discrete RMSD cluster values and 
the "y" values are the corresponding average accuracy measures found in the data series from panel (d) in Figures 8, 9, 10, 11, and 12. 
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appear to be present in the difference statistics for any 
3-D similarity measure that involves the CT measure 
(Le., ComboT 6 ^ ci ar ' apt , and ComboT CT ' opt ). 

Taking this all into account, in general, conformer 
model clustering increases the conformer RMSD value 
and decreases the four 3-D Tanimoto values, indicating 
the reduced accuracy of the conformer ensemble due to 
conformer clustering. The difference between the en- 
semble accuracies before and after clustering increases 
with the values of N NHA) N R , and N ER , implying that the 
effects of conformer clustering become more noticeable 
in bigger and more flexible molecules, which is expected 
considering that the RMSD cluster value gets larger [Equation 
(3)]. As compared in Figures 9 and 11, the average con- 
former CT CT ~ opt accuracy values show a larger decrease 
upon clustering than the average conformer ST €T ~ opt values, 
meaning that the conformer ci CT ' opt values are more sen- 
sitive to clustering than the conformer ST 6710 ^ values. 
However, conformer clustering decreases the average 
ComboT ST ' opt and average ComboT CT - opt values (Fig ures 10 
and 12) in a similar amount, again showing the insensitive- 
ness of the ComboT value to the optimization type. 
A similar insensitiveness of the ComboT value to the 
optimization type was also observed in our previous studies 
[9,11], in which the distribution of the Combo'I 6T ' opt scores 
between randomly selected conformers were found very 
similar to that of the Combo J CT ' opt . 

What does this all mean? The average loss of accuracy 
of PubChem3D conformer ensembles behaves in a pre- 
dictable fashion, even after sampling, as a function of 
molecular size and flexibility across PubChem3D simi- 
larity measures. There is a linear degradation of accuracy 
to reproduce bioactive conformers both before and after 
sampling procedures. In general, there is a modest 
amount of degradation of accuracy to reproduce bio- 
activity as a part of this sampling procedure. Generally 
speaking, one expects the worst-case minimum accuracy 
of 90% of the PubChem3D ensembles to be (as stated pre- 
viously from Figure 7) 0.75, 1.09, 0.43, and 1.13, in terms 
of ST 6 ^ ComboT 6 ^, CT CT ' opt 9 and ComboT CT ' opt , 
respectively. This expected minimum accuracy improves 
linearly as the molecule becomes smaller or less flexible. 

One may ask "how good or how bad are these worst- 
case minimum accuracies?" To answer this question, it 
is necessary to determine an appropriate cut-off value 
for a "close" reproduction of the experimental structure, 
and our recent study [11], which studied the statistical 
significance of the ROCS-based similarity scores, pro- 
vides some clues on an appropriate choice of the cut-off 
values. In this study [11], the ROCS-based 3-D similarity 
scores between randomly-selected biologically-tested 
compounds were computed, and from the distribution 
of these scores, conversion tables were generated which 
convert a ROCS-based similarity score to the p- value of 



getting that particular score by randomly selecting two 
biologically-tested conformers. According to these con- 
version tables, the j^-value of getting a similarity score 
equal to the worse-case minimum accuracy by selecting 
two random conformers is 0.019, 0.002, 0.003, and 0.002 
for si 6T ' opt , CornboT 5 ^, ci CT ' opt , and ComboT 00 , 
respectively. If the significance level (a) of 0.05 is 
employed, these ^-values are small enough to reject the 
null hypothesis of getting a particular 3-D similarity 
score by chance. Although this interpretation also 
depends on the significance level one may choose, it is 
still true that these worst-case minimum accuracies of 
the conformer models (0.75, 1.09, 0.43, and 1.13, for 
SJGT-0& Cornbol 6 ^, cf 2 ^, and Combo^ 00 , 
respectively) are much greater than one may expect 
from randomly selected conformer pairs (0.54 ± 0.10, 
0.62 ± 0.13, 0.18 ± 0.06, and 0.59 ± 0.14, for si ST ' opt t 
ComboT** 0 ^, ci CT ' op \ and ComboT* 2 ^ 00 \ respectively) 
[9,11], implying structural similarity between the con- 
former model and the experimental structure. Also note 
that this interpretation is consistent with the fact 
that the 90% of the conformer models considered in 
this study have RMSD accuracies better than 1.1 A, 
which is much tighter than the common upper bound 
(RMSD 2.0 A) for successful reproduction of an experi- 
mental conformation in molecular docking, as men- 
tioned above. 

When it comes to biological activity data analysis, the 
present study shows that there will be a definitive upper 
limit to the PubChem3D conformer ensemble accuracy 
based on the molecular size and flexibility. While the 
results of the present study consider all sampled confor- 
mers, PubChem3D search and analysis tools use a di- 
verse subset of sampled conformers, where the diverse 
subset selection criterion is the Combo^' 0 ^ dissimilar- 
ity. [The reason for using the ComboT dissimilarity is 
that it considers both the ST and CT dissimilarity simul- 
taneously. While the choice of the optimization type is 
somewhat arbitrary, our previous studies [9,11] have 
shown that the ComboT score is not very sensitive to 
the optimization type in the aggregate.] The effects of 
using a diverse set of sampled conformers will likely fur- 
ther decrease performance beyond that reported in this 
study. In addition, one can expect that, as the desired 
3-D Tanimoto threshold increases in a given biological 
activity analysis, the ability to interrelate larger and more 
flexible molecules will decrease, not because they neces- 
sarily lack common biologically accessible conformer 
space, but because of the inherent similarity distance be- 
tween the stored sampled conformers. This analysis also 
suggests that the use of a single "one-size-fits-all" simi- 
larity Tanimoto threshold for PubChem3D molecules 
may not be an ideal choice for conformer models 
sampled at different RMSD values. The results from this 
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study suggest that conformer sampling may exacerbate 
the molecular size/flexibility dependency already present 
in conformer generation software [13]. Smaller and less 
flexible molecules in PubChem3D will have a tighter 
conformational sampling (with a smaller spacing be- 
tween conformers) than larger and more flexible mole- 
cules, and therefore, can interrelate more molecules at 
a given Tanimoto value. In addition, smaller and less 
flexible molecules will have fewer sampled conformers 
in their respective conformer ensemble and will likely 
have less of a reduction in accuracy due to the use of a 
diverse subset. As a result, a search using a smaller and 
less flexible molecule as a query is likely to return more 
3-D similar molecules than a search using a larger and 
more flexible query molecule. Furthermore, even if a 
large or flexible molecule is used as a 3-D similarity 
query, an increasing proportion of returned results are 
likely to be smaller or less flexible as the Tanimoto value 
is increased. This potential bias towards conformer 
models with smaller sampling distances may be worth 
further consideration and study to develop a more reli- 
able 3-D similarity-based biological activity analysis 
method. 

Effects of experimental uncertainties upon conformer 
model accuracies 

Like any experimentally- derived measurements, the crystal 
structures stored in PDB have uncertainties in their 
atomic coordinates, and the interpretation of the accuracy 
of a computationally-derived conformer model should 
take into account the positional uncertainty of the corre- 
sponding experimental ligand structure. For example, if 
the positional uncertainty in the experimental structure is 
greater than the RMSD value of the conformer model, 
comparison between the experimental and theoretical 
ligand structures are not particularly meaningful. The 
average positional errors in atoms in a crystal structure 
can be evaluated with the diffraction-component precision 
index (DPI) [41,42], which can be approximated as pro- 
posed by Blow [43], using information commonly con- 
tained in the header of a PDB file. In the study by 
Hawkins et al. [18], the crystal structures with the DPI of 
< 0.42 A were considered to be precise enough for the use 
as a standard dataset for validation of conformer genera- 
tors, and in this way, the conformer models with the 
RMSD value of > 0.6 A (= V2 x DPI [44]) were guaranteed 
to be meaningful predictions. 

Although the present study did not focus on potential 
issues concerning the experimental uncertainties of PDB 
structures [13,32-36], it is still valuable to test the con- 
former model accuracy against a set of highly-reliable ex- 
perimental structures. Therefore, a set of 157 high-quality 
ligand molecules (Additional file 4) was constructed from 
a set of 197 PDB structures recommended in the study by 



Hawkins et al. [18] (see the Methods section). The distri- 
bution of the RMSD and ComboT^^ 0 ^ accuracies for 
these 157 structures are shown in Figure 13, and their 
average and median values are compared in Table 3, with 
those from the 197 molecules considered in the study by 
Hawkins et al [18] [The ComboT^^ 0 ^ accuracy is identi- 
cal to the Tanimoto Combo in their study]. As shown in 
Table 3, when going from the 47,123-ligand set to 157- 
ligand set, the average RMSD value of the conformer 
models increased by 0.08 A (from 0.57 A to 0.65 A) and 
the average ComboT CT ~ opt accuracy decreased by 0.06 
(from 1.61 to 1.55). These differences do not seem very 
meaningful, considering the standard deviations for the 
RMSD and ComboT^^ 0 ^ accuracies of the two sets. 

Note that the average RMSD value of the 157-ligand 
set differed only by 0.02 from that of the 197-ligand set 
from the study of Hawkins et al (0.65 A vs. 0.67 A) [18]. 
The difference in the ComboT CT ~ opt accuracy between 
the two sets were 0.01 (1.55 and 1.56 for the 157- and 
197-ligand sets, respectively). Considering that our study 
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Figure 13 Distribution of the conformer model accuracies for 
the 157 high-quality ligands. Distribution of before- and 
after-clustering accuracies of the conformer models from the 157 
ligand molecules selected from the 47,123 PDB ligands considered 
in the present study: (a) the RMSD accuracy, and 
(b) the Combol cr ' opt accuracy. 
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Table 3 Comparison of the average and median RMSD and Combol* 7 ' 013 * values between different PDB ligand sets 



Accuracy 


47,123 Ligands 


157 Ligand 


s 


197 


measure 


Before clustering 


After clustering 


Before clustering 


After clustering 


Ligands 
(Ref. [18]) 


RMSD (A) 


0.39 (±0.24) / 0.30 


0.57 (±0.36) / 0.50 


0.40 (±0.26) / 0.33 


0.65 (±0.31)/ 0.61 


0.67/0.51 


Combof T -° pt 


1 .77 (±0.20) / 1 .85 


1.61 (±0.27)/ 1.70 


1 .75 (±0.22) / 1 .83 


1.55 (±0.25)/ 1.58 


1 .56 / 1 .64 



Numbers before and after slash are the mean and median, respectively, and numbers in parentheses are the standard deviations. The 47,123-ligand set are all the 
PDB ligand molecules considered in the present study, and the 157-ligand set contains 157 high-quality ligand molecules selected from the 47,123 compounds. 
The 197-ligand set contains 197 high-quality ligand molecules from the study by Hawkins et ol. [18]. 



used OMEGA parameters different from those used in 
their study, the conformer model accuracies from the 
two studies do not seem very different. 

Conclusion 

In the present study, conformer ensembles for 47,123 
PDB ligand molecules from MMDB were computation- 
ally generated using the PubChem3D approach. The ac- 
curacy of reproduction of the conformer models was 
investigated in comparison to the experimentally-derived 
structures as a function of the RMSD and the Pub- 
Chem3D similarity scores {i.e., ST 6T -° pt 1 Combol 6 ^ 00 , 
Cjcr-opt^ and cornbo^ 00 ). The PubChem3D con- 
former sampling procedure increased the RMSD value 
of the conformer ensemble by 0.18 ± 0.12 A, and 
decreased the accuracy of the ST ST ~ opt , ComboT ST ' opt , 
CJ cT-o P t^ and Combo jCT-opt accuracies by o.04 ± 0.03, 

0.16 ± 0.09, 0.09 ± 0.05, and 0.15 ± 0.09, respectively 
(see Table 1), indicating a decrease in the conformer en- 
semble accuracy in general. For all five accuracy mea- 
sures (RMSD, SI 6 *' 00 , Combo! 6 *' 00 , CT 07 ^, and 
ComboT 0 *' 00 ), the conformer model accuracies before 
and after clustering linearly decreased with the increase 
in the RMSD cluster value (as well as N NHA , N R and N ER ), 
with R 2 values to fit these curves greater than 0.91 (see 
Figures 8, 9, 10, 11 and 12 and Table 2). 

Whereas the change in the CT CT ~ opt accuracy (0.09 ± 
0.05) upon clustering was much greater than the S7 6T ~ 
opt average difference (0.04 ± 0.03), the ComboT 6 ^ 00 
and ComboT 0 ^ 00 changes had similar average and 
standard deviations (0.16 ± 0.09 vs. 0.15 ± 0.09). This 
implies that, in general, while the CT 0 ^ 00 accuracy is 
more sensitive to the clustering than the SI 6 ^ 00 accur- 
acy, the effect of the clustering upon the ComboT ac- 
curacy is not sensitive to the optimization type. 
Similarly, while the rate of the decrease of the SJ 6 ^ 00 
accuracy with the increase in molecular size and flexi- 
bility was much slower than that of the CT 0 ^ 00 accur- 
acy (Figure 9 vs. Figure 11), the ComboT ST ' opt and 
ComboT 01 ^' 00 accuracies decreased at a similar rate 
(Figure 10 vs. Figure 12). 

This study shows that there is a definitive limit in the 
ability of the PubChem3D sampled conformer models to 
reproduce the bioactive conformations found in PDB 
ligands. This study also suggests that larger and more 



flexible molecules may be less able to interrelate with 
other larger and more flexible molecules at a given 
Tanimoto value than smaller and less flexible molecules do. 
[This is also supported by our recent study [8] on the 
PubChem 3-D neighbors. The PubChem 3-D neighbors 
(also known as "similar conformers") are defined as any 
two compounds that are structurally similar (with SJ 6 ^ 00 
> 0.8 and CJ 6 ^ 00 > 0.5), and it was found that com- 
pounds without 3-D neighbors occur more frequently 
among larger compounds than among smaller com- 
pounds. In addition, smaller molecules tend to have more 
3-D neighbors than larger molecules]. As a result, one 
may want to consider such effects when performing a 3-D 
similarity search or 3-D biological activity data analysis. 
Specifically in the case of 90% of the PubChem3D con- 
former models, in general, one can expect the worst-case 
minimum accuracy to be 0.75, 1.09, 0.43, and 1.13, in terms 
of SI 6 *' 00 Combol 6 ^ 00 C^' 00 and ComboT 00 , 
respectively (see Figure 7). These values are expected to 
linearly improve as the molecules considered become smal- 
ler and less flexible. In addition, these values may become 
worse if a diverse subset of sampled conformers is used. 



Methods 

Datasets 

The experimental 3-D structures of small molecules were 
downloaded from the Molecular Modeling Database 
(MMDB) ligand dataset [45,46] as available from the 
PubChem Substance database at the National Center for 
Biotechnology Information (NCBI) (as of July 1, 2010). 
Ligands too small or too big were discarded by limiting 
the non-hydrogen atom count to 6 - 50. Ligands too flex- 
ible (with an effective rotor count greater than 15) were 
also eliminated. This filtering stage resulted in a set of 
47,123 3-D non-unique, organic (i.e., carbon containing) 
3-D experimental reference structures, where a 3-D con- 
former model could be generated. The distributions of 
molecular size and flexibility of the dataset are depicted in 
Figure 1. 

To test effects of the experimental uncertainties upon 
the evaluation of the conformer model accuracies, a 
subset of the 47,123-ligand set, which contains 157 
"high-quality" ligand structures, was constructed in the 
procedures described below. 
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(1) Select the PubChem Substance records associated 
with the MMDB records for the 197 PDB structures 
determined in the study by Hawkins et al [18]. 
These PDB structures were determined by 
considering the local quality of fit of the ligand to 
its density, as well as global level metrics of the 
protein structure. Because some of the 197 PDB 
structures had multiple ligands, there were 265 
PubChem Substance records associated with these 
protein structures. Note that, because their study 
provided a list of the PDB identifiers (without a 
unique ligand identifier), it was difficult to 
determine what ligands were actually included in 
the 197-ligand set. Therefore, next filtering steps 
similar to those used in their study were taken 
subsequently. 

(2) Select the PubChem Substance records that are 
neither too rigid nor too flexible (3 < N R < 16), and 
that are neither too small nor too large (8< N NHA < 
50). This filtering stage resulted in 200 PubChem 
Substance records. 

(3) Select the PubChem Substance records with good 
"local" quality of fit to the density. Hawkins et al 
[18] used three metrics for this purpose: the real- 
space correlation coefficient (RSCC) [47], the real- 
space R- value (RSR) [48], and the occupancy- 
weighted B-factor (OWAB). In the present study, 
the same criteria as used in their study (RSCC > 0.9, 
RSR < 0.2, and 1 < OWAB < 50) were applied, after 
downloading these data from the electron density 
server (EDS) [49,50]. After this step, 176 structures 
were remained. 

(4) Some of the remaining 176 PubChem Substance 
Records were associated with identical PubChem 
Compound Records, or had the same three-letter 
PDB ligand codes, implying that they were the same 
ligand molecule. In these cases, the one with the 
largest RSCC value was retained, and the others 
were removed. After removing the redundancy, 
there were 164 structures remained. 

(5) When any pair of the remaining 164 structures had 
the PubChem 2-D similarity score of > 0.9 
(computed using the PubChem 2-D subgraph 
fingerprints [5] and the Tanimoto equation [6,7]), 
the one with the largest RSCC value was retained 
and the other was removed. [In the study of 
Hawkins et al [18], the LINGOS method [51] was 
used instead of the PubChem fingerprint to remove 
too similar molecules.] There were 157 ligands 
remained after this final filtering stage. 

Conformer generation using OMEGA 

The conformer ensemble for each molecule in the data- 
set was generated using the OMEGA software [28] from 



the OpenEye Scientific Software, Inc. The OMEGA 
application performs conformer generation in two pri- 
mary stages: fragment generation and torsion driving. 
The fragment generation stage splits the input structure 
into smaller pieces that are energy minimized and con- 
formationally sampled to get diverse 3-D representations 
for each molecule fragment. The torsion driving stage 
reassembles and iterates over the fragments from the 
first stage using particular rule-based torsion angles that 
depend on the molecular environment between connect- 
ing fragments. More detailed description of the OMEGA 
software is given elsewhere [18,52]. 

OMEGA has many adjustable parameters to generate 
conformations with particular attributes, and the optimal 
set of parameter values used for the present study was 
based on our recent study [13]. The Merck Molecular 
Force Field (MMFF94s) without coulombic terms 
(MMFF94s_NoEstat) was used with the "startfact" value 
of 20. The energy window value of 25 kcal/mol was 
employed for both model building and torsion driving 
stages. The values used for other parameters were iden- 
tical to those used in the previous study [13]. 

As pointed out in a recent review by Scior et al [53], 
because adequate conformational space coverage is an 
important requirement for reliable 3-D similarity compu- 
tations, it would be desirable to consider as many confor- 
mations per molecule as possible. However, because it 
would require tremendous computational resources, it is 
inevitable to find a compromise between computational 
cost and conformational coverage. The PubChem3D con- 
former generation procedure generates a maximum of 
100,000 conformers for each chemical structure. As 
demonstrated in our previous study [13], this limit may 
not be enough for very flexible and large molecules, result- 
ing in truncation of conformational search. However, in 
the same study [13], it was shown that the 100-K limit 
does not cause a noticeable decrease in the "average" con- 
former model accuracy for smaller and less flexible mole- 
cules {i.e., with N NHA <35 and N ER <15). Therefore, this 
100-K limit seems adequate for these molecules in 
general. 

Clustering of conformer ensembles 

After conformer models were produced, a data reduc- 
tion was performed whereby conformers were sampled 
to identify a random set of conformers that have a mini- 
mum RMSD distance to each other. This minimum 
RMSD distance was determined by rounding the 
RMSiy red value [in Equation (3)] to the nearest 0.2 in- 
crement [i.e., Equation (4)]. The conformers in each 
conformer ensemble were down-sampled using a 
partition-based clustering scheme, as described in our 
previous study [15], with the RMSD as a distance 
threshold between conformers (that is, RMSD cluster ) and 
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the lowest-energy conformer in each partition as an ini- 
tial "seed" structure for clustering of that partition. The 
centroid of each cluster was selected as the representa- 
tive conformer of that cluster to construct a smaller 
conformer model with 500 conformers or less. If the 
conformer model had more than 500 conformers after 
sampling, it was re- clustered with the RMSD cluster value 
incremented by a further 0.2. This re-clustering process 
was repeated as many times as necessary to reduce the 
overall conformer count to be 500 or less. Note that, 
because the lowest-energy conformer in each partition 
was used as an initial seed, low-energy conformers are 
more likely to be included than high-energy conformers 
when all partitions are combined together for next 
round of clustering. As a result, the final conformer 
model sampled though clustering is more likely to in- 
clude low-energy conformers than high-energy confor- 
mers. All RMSD values computed in this study used the 
OEChem OERMSD function with: "overlay" turned on 
to allow rotation/translation to yield the lowest possible 
RMSD value; and "automorph" detection turned on to 
allow proper treatment of symmetrically equivalent 
atoms, except when its use resulted in excessive run- 
time [an extremely rare event (at a rate of about 1 in 
10,000) generally caused by large, nearly symmetric 
molecules] . 

Evaluation of ensemble accuracies 

The accuracy of the clustered ensembles was estimated 
using five different accuracy measures: RMSD, ST 6T ' opt t 
ComboT 6 ^, ci CT ' opt 9 and Combo^' 0 ^ The latter four 
accuracy measures were computed using ROCS [37,38,52], 
Note that the generated conformer model had up to 500 
conformers, and the accuracy of the conformer model was 
evaluated by selecting the best conformer that was closest 
to the experimental structure (that is, the one with the 
smallest RMSD value or the largest ROCS -based similarity 
values). 

Additional files 



Additional file 4: List of the 157 high-quality ligands selected from 
the 47,123 ligands. This file contains a list of the PC Substance ID for 
157 high-quality ligands selected from the 47,123 molecules. 
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